In many systematic reviews, it is common for eligible studies to contribute effect size estimates from not just one, but multiple relevant outcome measures, for a common sample of participants. If those outcomes are correlated, then so too will be the effect size estimates. To estimate the degree of correlation, you would need the sample correlation among the outcomes—information that is woefully uncommon for primary studies to report (and best of luck to you if you try to follow up with author queries). Thus, the meta-analyst is often left in a situation where the sampling variances of the effect size estimates can be reasonably well approximated, but the sampling covariances are unknown for some or all studies.
Several solutions to this conundrum have been proposed in the meta-analysis methodology literature. One possible strategy is to just impute a correlation based on subject-matter knowledge (or at least feigned expertise), and assume that this correlation is constant across studies. This analysis could be supplemented with sensitivity analyses to examine the extent to which the parameter estimates and inferences are sensitive to alternative assumptions about the inter-correlation of effects within studies. A related strategy, described by Wei and Higgins (2013), is to meta-analyze any available correlation estimates and then use the results to impute correlations for any studies with missing correlations.
Both of these approaches require the meta-analyst to calculate block-diagonal sampling covariance matrices for the effect size estimates, which can be a bit unwieldy. I often use the impute-the-correlation strategy in my meta-analysis work and have written a helper function to compute covariance matrices, given known sampling variances and imputed correlations for each study. In the interest of not repeating myself, I’ve added the function to the latest version of my clubSandwich package. In this post, I’ll explain the function and demonstrate how to use it for conducting meta-analysis of correlated effect size estimates.
An R function for block-diagonal covariance matrices
Here is the function:
Code
library(clubSandwich)impute_covariance_matrix
function (vi, cluster, r, ti, ar1, smooth_vi = FALSE, subgroup = NULL,
return_list = identical(as.factor(cluster), sort(as.factor(cluster))),
check_PD = TRUE)
{
cluster <- droplevels(as.factor(cluster))
vi_list <- split(vi, cluster)
if (smooth_vi)
vi_list <- lapply(vi_list, function(x) rep(mean(x, na.rm = TRUE),
length(x)))
if (missing(r) & missing(ar1))
stop("You must specify a value for r or for ar1.")
if (!missing(r)) {
r_list <- rep_len(r, length(vi_list))
if (missing(ar1)) {
vcov_list <- Map(function(V, rho) (rho + diag(1 -
rho, nrow = length(V))) * tcrossprod(sqrt(V)),
V = vi_list, rho = r_list)
}
}
if (!missing(ar1)) {
if (missing(ti))
stop("If you specify a value for ar1, you must provide a vector for ti.")
ti_list <- split(ti, cluster)
ar_list <- rep_len(ar1, length(vi_list))
if (missing(r)) {
vcov_list <- Map(function(V, time, phi) (phi^as.matrix(stats::dist(time))) *
tcrossprod(sqrt(V)), V = vi_list, time = ti_list,
phi = ar_list)
}
else {
vcov_list <- Map(function(V, rho, time, phi) (rho +
(1 - rho) * phi^as.matrix(stats::dist(time))) *
tcrossprod(sqrt(V)), V = vi_list, rho = r_list,
time = ti_list, phi = ar_list)
}
vcov_list <- lapply(vcov_list, function(x) {
attr(x, "dimnames") <- NULL
x
})
}
if (!is.null(subgroup)) {
si_list <- split(subgroup, cluster)
subgroup_list <- lapply(si_list, function(x) sapply(x,
function(y) y == x))
vcov_list <- Map(function(V, S) V * S, V = vcov_list,
S = subgroup_list)
}
if (check_PD)
check_PD(vcov_list)
if (return_list) {
return(vcov_list)
}
else {
vcov_mat <- unblock(vcov_list)
cluster_index <- order(order(cluster))
return(vcov_mat[cluster_index, cluster_index])
}
}
<bytecode: 0x000001d425d53050>
<environment: namespace:clubSandwich>
The function takes three required arguments:
vi is a vector of sampling variances.
cluster is a vector identifying the study from which effect size estimates are drawn. Effects with the same value of cluster will be treated as correlated.
r is the assumed value(s) of the correlation between effect size estimates from each study. Note that r can also be a vector with separate values for each study.
Here is a simple example to demonstrate how the function works. Say that there are just three studies, contributing 2, 3, and 4 effects, respectively. I’ll just make up some values for the effect sizes and variances:
Code
dat <-data.frame(study =rep(LETTERS[1:3], 2:4), yi =rnorm(9), vi =4:12)dat
study yi vi
1 A 2.5441763 4
2 A 1.5374848 5
3 B -0.3374318 6
4 B 0.3957611 7
5 B -0.4592493 8
6 C 0.4457232 9
7 C -0.6421787 10
8 C -1.9565539 11
9 C -1.1833011 12
I’ll assume that effect size estimates from a given study are correlated at 0.7:
Code
V_list <-impute_covariance_matrix(vi = dat$vi, cluster = dat$study, r =0.7)V_list
The result is a list of matrices, where each entry corresponds to the variance-covariance matrix of effects from a given study. To see that the results are correct, let’s examine the correlation matrix implied by these correlation matrices:
As requested, effects are assumed to be equi-correlated with r = 0.7.
If the data are sorted in order of the cluster IDs, then the list of matrices returned by impute_covariance_matrix() can be fed directly into the rma.mv function in metafor (as I demonstrate below). However, if the data are not sorted by cluster, then feeding in the list of matrices will not work correctly. Instead, the full \(N \times N\) variance-covariance matrix (where \(N\) is the total number of effect size estimates) will need to be calculated so that the rows and columns appear in the correct order. To address this possibility, the function includes an optional argument, return_list, which determines whether to output a list of matrices (one matrix per study/cluster) or a single matrix corresponding to the full variance-covariance matrix across all studies. By default, return_list tests for whether the cluster argument is sorted and returns the appropriate form. The argument can also be set directly by the user.
Here’s what happens if we feed in the data in a different order:
study yi vi
1 A 2.5441763 4
3 B -0.3374318 6
2 A 1.5374848 5
8 C -1.9565539 11
6 C 0.4457232 9
7 C -0.6421787 10
4 B 0.3957611 7
9 C -1.1833011 12
5 B -0.4592493 8
Code
V_mat <-round(impute_covariance_matrix(vi = dat_scramble$vi, cluster = dat_scramble$study, r =0.7), 3)V_mat
To see that this is correct, check that the diagonal entries of V_mat are the same as vi:
Code
all.equal(dat_scramble$vi, diag(V_mat))
[1] TRUE
An example with real data
Kalaian and Raudenbush (1996) introduced a multi-variate random effects model, which can be used to perform a joint meta-analysis of studies that contribute effect sizes on distinct, related outcome constructs. They demonstrate the model using data from a synthesis on the effects of SAT coaching, where many studies reported effects on both the math and verbal portions of the SAT. The data are available in the clubSandwich package:
Code
library(dplyr, warn.conflicts=FALSE)data(SATcoaching)# calculate the mean of log of coaching hoursmean_hrs_ln <- SATcoaching %>%group_by(study) %>%summarise(hrs_ln =mean(log(hrs))) %>%summarise(hrs_ln =mean(hrs_ln, na.rm =TRUE))# clean variables, sort by study IDSATcoaching <- SATcoaching %>%mutate(study =as.factor(study),hrs_ln =log(hrs) - mean_hrs_ln$hrs_ln ) %>%arrange(study, test)SATcoaching %>%select(study, year, test, d, V, hrs_ln) %>%head(n =20)
The correlation betwen math and verbal test scores are not available, but it seems reasonable to use a correlation of r = 0.66, as reported in the SAT technical information. To synthesize these effects, I’ll first compute the required variance-covariances:
Code
V_list <-impute_covariance_matrix(vi = SATcoaching$V, cluster = SATcoaching$study, r =0.66)
This can then be fed into metafor to estimate a fixed effect or random effects meta-analysis or meta-regression models:
Code
library(metafor, quietly =TRUE)# bivariate fixed effect meta-analysisMVFE_null <-rma.mv(d ~0+ test, V = V_list, data = SATcoaching)MVFE_null
# bivariate random effects meta-analysisMVRE_null <-rma.mv(d ~0+ test, V = V_list, data = SATcoaching, random =~ test | study, struct ="UN")MVRE_null
Multivariate Meta-Analysis Model (k = 67; method: REML)
Variance Components:
outer factor: study (nlvls = 47)
inner factor: test (nlvls = 2)
estim sqrt k.lvl fixed level
tau^2.1 0.0122 0.1102 29 no Math
tau^2.2 0.0026 0.0507 38 no Verbal
rho.Math rho.Vrbl Math Vrbl
Math 1 - 20
Verbal -1.0000 1 no -
Test for Residual Heterogeneity:
QE(df = 65) = 72.1630, p-val = 0.2532
Test of Moderators (coefficients 1:2):
QM(df = 2) = 18.1285, p-val = 0.0001
Model Results:
estimate se zval pval ci.lb ci.ub
testMath 0.1379 0.0434 3.1783 0.0015 0.0528 0.2229 **
testVerbal 0.1168 0.0337 3.4603 0.0005 0.0506 0.1829 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# bivariate random effects meta-regressionMVRE_hrs <-rma.mv(d ~0+ test + test:hrs_ln, V = V_list, data = SATcoaching,random =~ test | study, struct ="UN")MVRE_hrs
Multivariate Meta-Analysis Model (k = 65; method: REML)
Variance Components:
outer factor: study (nlvls = 46)
inner factor: test (nlvls = 2)
estim sqrt k.lvl fixed level
tau^2.1 0.0152 0.1234 28 no Math
tau^2.2 0.0014 0.0373 37 no Verbal
rho.Math rho.Vrbl Math Vrbl
Math 1 - 19
Verbal -1.0000 1 no -
Test for Residual Heterogeneity:
QE(df = 61) = 67.9575, p-val = 0.2523
Test of Moderators (coefficients 1:4):
QM(df = 4) = 23.6459, p-val < .0001
Model Results:
estimate se zval pval ci.lb ci.ub
testMath 0.0893 0.0507 1.7631 0.0779 -0.0100 0.1887 .
testVerbal 0.1062 0.0357 2.9738 0.0029 0.0362 0.1762 **
testMath:hrs_ln 0.1694 0.0725 2.3354 0.0195 0.0272 0.3116 *
testVerbal:hrs_ln 0.0490 0.0459 1.0681 0.2855 -0.0409 0.1389
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results of fitting this model using restricted maximum likelihood with metafor are actually a bit different from the estimates reported in the original paper, potentially because Kalaian and Raudenbush use a Cholesky decomposition of the sampling covariances, which alters the interpretation of the random effects variance components. The metafor fit is also a bit goofy because the correlation between the random effects for math and verbal scores is very close to -1, although evidently it is not uncommon to obtain such degenerate estimates of the random effects structure.
Robust variance estimation.
Experienced meta-analysts will no doubt point out that a further, alternative analytic strategy to the one described above would be to use robust variance estimation methods (RVE; Hedges, Tipton, & Johnson). However, RVE is not so much an alternative strategy as it is a complementary technique, which can be used in combination with any of the models estimated above. Robust standard errors and hypothesis tests can readily be obtained with the clubSandwich package. Here’s how to do it for the random effects meta-regression model:
RVE is also available in the robumeta R package, but there are several differences between the implementation there and the method I’ve demonstrated here. From the user’s perspective, an advantage of robumeta is that it does all of the covariance imputation calculations “under the hood,” whereas with metafor the calculations need to be done prior to fitting the model. Beyond this, differences include:
robumeta uses a specific random effects structure that can’t be controlled by the user, whereas metafor can be used to estimate a variety of different random effects structures;
robumeta uses a moment estimator for the between-study variance, whereas metafor provides FML or REML estimation;
robumeta uses semi-efficient, diagonal weights when fitting the meta-regression, whereas metafor uses weights that are fully efficient (exactly inverse-variance) under the working model.
The advantages and disadvantages of these two approaches involve some subtleties that I’ll get into in a future post.
---title: Imputing covariance matrices for meta-analysis of correlated effectsdate: '2017-08-10'categories:- meta-analysis- sandwiches- robust variance estimation- Rstatscode-tools: true---In many systematic reviews, it is common for eligible studies to contribute effect size estimates from not just one, but _multiple_ relevant outcome measures, for a common sample of participants. If those outcomes are correlated, then [so too will be the effect size estimates](/posts/Correlations-between-SMDs/). To estimate the degree of correlation, you would need the sample correlation among the outcomes---information that is woefully uncommon for primary studies to report (and best of luck to you if you try to follow up with author queries). Thus, the meta-analyst is often left in a situation where the sampling _variances_ of the effect size estimates can be reasonably well approximated, but the sampling _covariances_ are unknown for some or all studies. Several solutions to this conundrum have been proposed in the meta-analysis methodology literature. One possible strategy is to just impute a correlation based on subject-matter knowledge (or at least feigned expertise), and assume that this correlation is constant across studies. This analysis could be supplemented with sensitivity analyses to examine the extent to which the parameter estimates and inferences are sensitive to alternative assumptions about the inter-correlation of effects within studies. A related strategy, described by [Wei and Higgins (2013)](https://dx.doi.org/10.1002/sim.5679), is to meta-analyze any available correlation estimates and then use the results to impute correlations for any studies with missing correlations. Both of these approaches require the meta-analyst to calculate block-diagonal sampling covariance matrices for the effect size estimates, which can be a bit unwieldy. I often use the impute-the-correlation strategy in my meta-analysis work and have written a helper function to compute covariance matrices, given known sampling variances and imputed correlations for each study. In the interest of not repeating myself, I've added the function to the latest version of my clubSandwich package. In this post, I'll explain the function and demonstrate how to use it for conducting meta-analysis of correlated effect size estimates. ## An R function for block-diagonal covariance matricesHere is the function: ```{r}library(clubSandwich)impute_covariance_matrix```The function takes three required arguments: * `vi` is a vector of sampling variances.* `cluster` is a vector identifying the study from which effect size estimates are drawn. Effects with the same value of `cluster` will be treated as correlated.* `r` is the assumed value(s) of the correlation between effect size estimates from each study. Note that `r` can also be a vector with separate values for each study. Here is a simple example to demonstrate how the function works. Say that there are just three studies, contributing 2, 3, and 4 effects, respectively. I'll just make up some values for the effect sizes and variances:```{r}dat <-data.frame(study =rep(LETTERS[1:3], 2:4), yi =rnorm(9), vi =4:12)dat```I'll assume that effect size estimates from a given study are correlated at 0.7:```{r}V_list <-impute_covariance_matrix(vi = dat$vi, cluster = dat$study, r =0.7)V_list```The result is a list of matrices, where each entry corresponds to the variance-covariance matrix of effects from a given study. To see that the results are correct, let's examine the correlation matrix implied by these correlation matrices:```{r}cov2cor(V_list$A)cov2cor(V_list$B)cov2cor(V_list$C)```As requested, effects are assumed to be equi-correlated with r = 0.7.If the data are sorted in order of the cluster IDs, then the list of matrices returned by `impute_covariance_matrix()` can be fed directly into the `rma.mv` function in metafor (as I demonstrate below). However, if the data are not sorted by `cluster`, then feeding in the list of matrices will not work correctly. Instead, the full $N \times N$ variance-covariance matrix (where $N$ is the total number of effect size estimates) will need to be calculated so that the rows and columns appear in the correct order. To address this possibility, the function includes an optional argument, `return_list`, which determines whether to output a list of matrices (one matrix per study/cluster) or a single matrix corresponding to the full variance-covariance matrix across all studies. By default, `return_list` tests for whether the `cluster` argument is sorted and returns the appropriate form. The argument can also be set directly by the user. Here's what happens if we feed in the data in a different order:```{r}dat_scramble <- dat[sample(nrow(dat)),]dat_scrambleV_mat <-round(impute_covariance_matrix(vi = dat_scramble$vi, cluster = dat_scramble$study, r =0.7), 3)V_mat```To see that this is correct, check that the diagonal entries of `V_mat` are the same as `vi`:```{r}all.equal(dat_scramble$vi, diag(V_mat))```## An example with real data[Kalaian and Raudenbush (1996)](https://dx.doi.org/10.1037/1082-989X.1.3.227) introduced a multi-variate random effects model, which can be used to perform a joint meta-analysis of studies that contribute effect sizes on distinct, related outcome constructs. They demonstrate the model using data from a synthesis on the effects of SAT coaching, where many studies reported effects on both the math and verbal portions of the SAT. The data are available in the `clubSandwich` package:```{r}library(dplyr, warn.conflicts=FALSE)data(SATcoaching)# calculate the mean of log of coaching hoursmean_hrs_ln <- SATcoaching %>%group_by(study) %>%summarise(hrs_ln =mean(log(hrs))) %>%summarise(hrs_ln =mean(hrs_ln, na.rm =TRUE))# clean variables, sort by study IDSATcoaching <- SATcoaching %>%mutate(study =as.factor(study),hrs_ln =log(hrs) - mean_hrs_ln$hrs_ln ) %>%arrange(study, test)SATcoaching %>%select(study, year, test, d, V, hrs_ln) %>%head(n =20)```The correlation betwen math and verbal test scores are not available, but it seems reasonable to use a correlation of r = 0.66, as reported in the SAT technical information. To synthesize these effects, I'll first compute the required variance-covariances:```{r}V_list <-impute_covariance_matrix(vi = SATcoaching$V, cluster = SATcoaching$study, r =0.66)```This can then be fed into `metafor` to estimate a fixed effect or random effects meta-analysis or meta-regression models:```{r}library(metafor, quietly =TRUE)# bivariate fixed effect meta-analysisMVFE_null <-rma.mv(d ~0+ test, V = V_list, data = SATcoaching)MVFE_null# bivariate fixed effect meta-regressionMVFE_hrs <-rma.mv(d ~0+ test + test:hrs_ln, V = V_list, data = SATcoaching)MVFE_hrs# bivariate random effects meta-analysisMVRE_null <-rma.mv(d ~0+ test, V = V_list, data = SATcoaching, random =~ test | study, struct ="UN")MVRE_null# bivariate random effects meta-regressionMVRE_hrs <-rma.mv(d ~0+ test + test:hrs_ln, V = V_list, data = SATcoaching,random =~ test | study, struct ="UN")MVRE_hrs```The results of fitting this model using restricted maximum likelihood with metafor are actually a bit different from the estimates reported in the original paper, potentially because Kalaian and Raudenbush use a Cholesky decomposition of the sampling covariances, which alters the interpretation of the random effects variance components. The metafor fit is also a bit goofy because the correlation between the random effects for math and verbal scores is very close to -1, although evidently it is not uncommon to obtain such degenerate estimates of the random effects structure. ## Robust variance estimation.Experienced meta-analysts will no doubt point out that a further, alternative analytic strategy to the one described above would be to use robust variance estimation methods (RVE; [Hedges, Tipton, & Johnson](https://dx.doi.org/10.1002/jrsm.5)). However, RVE is not so much an alternative strategy as it is a complementary technique, which can be used in combination with any of the models estimated above. Robust standard errors and hypothesis tests can readily be obtained with the [clubSandwich package](https://cran.r-project.org/package=clubSandwich). Here's how to do it for the random effects meta-regression model:```{r}library(clubSandwich)coef_test(MVRE_hrs, vcov ="CR2")```RVE is also available in the [robumeta R package](https://CRAN.R-project.org/package=robumeta), but there are several differences between the implementation there and the method I've demonstrated here. From the user's perspective, an advantage of robumeta is that it does all of the covariance imputation calculations "under the hood," whereas with metafor the calculations need to be done prior to fitting the model. Beyond this, differences include:* robumeta uses a specific random effects structure that can't be controlled by the user, whereas metafor can be used to estimate a variety of different random effects structures;* robumeta uses a moment estimator for the between-study variance, whereas metafor provides FML or REML estimation;* robumeta uses semi-efficient, diagonal weights when fitting the meta-regression, whereas metafor uses weights that are fully efficient (exactly inverse-variance) under the working model. The advantages and disadvantages of these two approaches involve some subtleties that I'll get into in a future post.